Power Law Distributions: Zipf's Law

(Note: Be sure to either download the data that goes with this notebook or uncomment/edit the lines below needed to get your own data before executing the code in this notebook.)

It turns out that "power law" distributions are ubitquitous in nature, see:

1 http://arxiv.org/abs/cond-mat/0412004

2 https://arxiv.org/abs/1812.09662

In the next feww cells of this notebook I measure the number of files of different sizes on my hard disk. Then I proceed to quantify (in the simples possible way) the distribution of file sizes in a certain range as a power law distribution.


In [1]:
import numpy as np
import matplotlib.pyplot as pl

import pandas as pd                    # pandas is the "python data analysis package"
import os                              # operating system services
from scipy.optimize import curve_fit   # non-linear curve fitting

In [2]:
def getFileSizes(folder, minsize=2000, maxsize=10**9):
    """
    return a list of file sizes in a folder
    """
    sizes = []
    for (path, dirs, files) in os.walk(folder):
        for file in files:
            try:
                filename = os.path.join(path, file)
                size=os.path.getsize(filename)
                if size>minsize and size<maxsize:
                    sizes.append(size)
            except OSError:
                pass
    return sizes

In [3]:
if 0: # set this to "1" to actaully collect data, but edit the paths to match your computer.
    minsize=10**4  # the minimum file size to consider
    maxsize=10**10 # the maximum file size to consider
    sizes = getFileSizes('/Users/steve/Desktop',minsize,maxsize) # <--- edit this to match some folder with a lot of files
    print(len(sizes), 'files between', minsize, 'and', maxsize)

In [4]:
if 0: # set this to "1" to write the file
    df=pd.DataFrame({'file_sizes':sizes})
    df.to_csv('myData.csv', sep=',')
else: # read the file
    df=pd.read_csv('myData.csv')
    df.head()
    sizes = df.file_sizes

In [6]:
ns,bins,patches = pl.hist(np.log(np.array(sizes)+1),40)



In [8]:
xvals=(bins[:-1]+bins[1:])/2.0
counts=ns+0.5
yvals=np.log(counts)
pl.title("distribution of file sizes")
pl.ylabel("log(count)")
pl.xlabel("log(size)")
pl.plot(xvals,yvals,'b.')


Out[8]:
[<matplotlib.lines.Line2D at 0x128915d90>]

Uncertainties in the Parameters

There are a variety of techniques for estimating the uncertainty in the parameters you compute. The approach we'll take is one of the more "brute force" approaches, but it's pretty robust. We're going to fabricate data that has the same statistics as our original data (do the degree that we know and/or care). Then we'll fit the fabricated data many times (1000 times in this case) and we'll look at the statistics of the fit parameters produced. From this we can tell how well our fit has done at narrowing down the parameters we computed.


In [11]:
#
# Just to see same data fit using the non-linear scipy package "curve_fit":
#

def fLinear(x, m, b):
    return m*x + b

m0 = (yvals[-1]-yvals[0])/(xvals[-1]-xvals[0])  # estimate slope
b0 = yvals[0] - m0*xvals[0]                     # estimate intercept
sigma=1.0/np.sqrt(np.abs(counts))

popt, pcov = curve_fit(fLinear, xvals, yvals, p0=(m0, b0), sigma=sigma)

m=popt[0]          # slope
dm=np.sqrt(pcov[0,0]) # sqrt(variance(slope))
b=popt[1]          # int
db=np.sqrt(pcov[1,1]) # sqrt(variance(int))
ystar=fLinear(xvals, m, b)

N=1000    # number of samples to take
mList=[]  # keep track of monte-carlo fit parameters
bList=[]

for i in range(N):
    """
    Generate mc data with the same statistical properties as the real data.
    Repeat the fit for each set, and record the parameters.
    """
    mcY = ystar + sigma*np.random.normal(size=len(xvals))  # generate fabricated data to fit
    mcpopt, mcpcov = curve_fit(fLinear, xvals, mcY, p0=(m,b), sigma=sigma)
    mList.append(mcpopt[0])  # store the fit paramters for the fab-data
    bList.append(mcpopt[1])

In [13]:
# check slope histogram

pl.hist(mList)


Out[13]:
(array([ 13.,  29.,  62., 144., 198., 217., 183., 111.,  34.,   9.]),
 array([-0.42226101, -0.41820593, -0.41415086, -0.41009578, -0.40604071,
        -0.40198563, -0.39793055, -0.39387548, -0.3898204 , -0.38576532,
        -0.38171025]),
 <a list of 10 Patch objects>)

In [14]:
# check intercept histogram

pl.hist(bList)


Out[14]:
(array([ 10.,  34., 114., 175., 222., 205., 134.,  69.,  26.,  11.]),
 array([10.02641787, 10.07381526, 10.12121264, 10.16861002, 10.21600741,
        10.26340479, 10.31080218, 10.35819956, 10.40559694, 10.45299433,
        10.50039171]),
 <a list of 10 Patch objects>)

In [17]:
#
# Compute the statistics of the mc-results
#

marr=np.array(mList)
barr=np.array(bList)

mAvg = marr.mean()
bAvg = barr.mean()
sigM = marr.std()
sigB = barr.std()

pl.errorbar(xvals, yvals, sigma, fmt='r.') 
pl.title("File Sizes on My HD")
pl.ylabel("log(count)")
pl.xlabel("log(size)")
pl.plot(xvals,ystar, 'g-')
print("Slope=", m, '+/-', sigM, "(", m-sigM,",",m+sigM, ")")
print("Intercept=", b, '+/-', sigB, "(", b-db,",",b+sigB, ")")

print() # print a blank link
print("Compare to cov-matrix for fun:")
print("sqsrt(pcov[0,0]) (should be sigma m)", np.sqrt(pcov[0,0]))
print("sqrt(pcov[1,1]) (should be sigma b)", np.sqrt(pcov[1,1]))


Slope= -0.4010129725637122 +/- 0.007056226722822071 ( -0.4080691992865343 , -0.39395674584089013 )
Intercept= 10.251110169860413 +/- 0.08162270120288012 ( 9.941549243952434 , 10.332732871063293 )

Compare to cov-matrix for fun:
sqsrt(pcov[0,0]) (should be sigma m) 0.02675033478186744
sqrt(pcov[1,1]) (should be sigma b) 0.30956092590797973

Project 7

See this reference for other systems that exhibit power law behavior:

http://arxiv.org/abs/cond-mat/0412004

Or test your intuition and see if it's right. Many system exhibit this behavior!

Get some data, and test your idea. Use the matrix technique to do the fitting.

Matrix Approach

WARNING: THE REST OF THIS NOTEBOOK IS OPTIONAL. YOU DO NOT NEED TO READ FURTHER UNLESS YOU ARE CURIOUS/INTERSTED IN THE INNER MACHINERY OF CURVE FITTING

The remainder of this notebook details an approach to parameter estimation using matrices. If you're interested, you're welcome to use this approach to fit your data. It has the (minor) advantage that there are no "black boxes" like curve_fit. The underlying mathematics is open for your inspection! However it's also pretty mathematically "heavy" so, please use curve_fit if you prefer.

Fitting Data with Matrices

We can see that we have something that looks like it might be a power law. How can we tell? We'll use "linear least squares" analysis.

When fitting some experimental data to a model, there are typically a set of paramters $(\alpha_1, \alpha_2, \ldots, \alpha_n)$ that need to be found so that the function $\chi^2$ is a minimum. We can think of the paramters as an n-vector called $\alpha$. The data consist of a set of paired numbers $(x_i, y_i)$ and we can define $\chi^2$ as

\begin{equation} \chi^2 = \sum_{i=1}^{m} {(y_i - y^\star(x_i,\alpha))^2\over \sigma_i^2}, \end{equation}

where $\sigma_i$ is the uncertainty in $y_i$, and $y^\star$ is the model value corresponding to $x_i$ (and the value of the parameters $\alpha$). If the model $y^\star(x,\vec\alpha)$ is linear then we can write $y^\star$ as:

\begin{equation} y^\star (x) = \alpha_1 X_1(x)+\alpha_2 X_2(x)+\ldots+\alpha_n X_n(x). \end{equation}

or

\begin{equation} y^\star(x) = \sum_{j=1}^{n} \alpha_j X_j(x), \end{equation}

where the $X_j$ are a set of functions that are to be superposed with coefficients $\alpha_j$ to form the model. A common example of this would be a straight line fit $y^\star(x) = mx+b$ where the functions are $x$ and 1 for the paramters $m$ and $b$ respectively.

Notice that, since the $x$'s are at fixed measured values, this whole thing can be cast into a single matrix equation. If we define the following $M$ matrix:

\begin{equation} M_{ij} = X_j(x_i) \end{equation}

and if we think of $y^\star$ as an $m$-vector of model values evaluated at the $m$ data points $x_i$ and $y$ as an experimental $m$-vector $y$, then the model vector can be evaluated by matrix multiplication as:

\begin{equation} y^\star = M \alpha \end{equation}

and the difference vector between the model and the data becomes simply

\begin{equation} y - y^\star = y - M\alpha \end{equation}

Finally we can write $\chi^2$ as

\begin{equation} \chi^2 = (y - M\alpha)^T S (y - M\alpha) \,\,\,\,\,\,\,\,(1) \end{equation}

where $S$ is a diagonal matrix of $1/\sigma_i^2$ and the $T$ on the left side means "transpose".

Now it is a simple matter to find the values of the paramters that minimize $\chi^2$. We require that $\nabla_{\alpha}\chi^2 = 0$. If you play around with this you can convince yourself that this means:

\begin{equation} M^T S(y - M\alpha) = 0 \end{equation}

and therefore:

\begin{equation} \alpha = (M^T SM)^{-1}M^T S y. \,\,\,\,\,\,\,\,(2) \end{equation}

In [18]:
#
# define the function values (these are the columns of the "M" matrix
#
# generate arrays that contain "functions" of x that are combined to form the model
#

def doFit(funcs, xvals, yvals, sigma):
    """
    doFit uses matrices to solve for the linear least squares 
    fit parameters based on the list of 'funcs', and the x-y values
    and their corresponding uncertainties (sigma).
    
    returns the array of model parameters
    """
    #
    # define the S matrix in terms of uncertainties
    #
    
    S=np.diag((1.0/sigma)**2)
    
    #
    # define the model matrix
    #
    
    M = np.array(funcs).T
    MtM = M.T.dot(S.dot(M))
    MtMInv = np.linalg.inv(MtM)
    MtY = M.T.dot(S.dot(yvals))
    alpha = MtMInv.dot(MtY)      # see eq 2 above.
    
    #
    # solve for the parameters, and return
    #
    ystar = M.dot(alpha)
    
    return (alpha, ystar, MtMInv)

fx1 = np.ones(len(xvals))   # X_1(x) = 1
fx2 = xvals              # X_2(x) = x

sigma=1.0/np.sqrt(np.abs(counts))

alpha, ystar, fcov = doFit([fx1, fx2], xvals, yvals, sigma)

mFit = alpha[1]
bFit = alpha[0]

pl.errorbar(xvals, yvals, sigma, fmt='r.') 
pl.title("File Sizes on My HD")
pl.ylabel("log(count)")
pl.xlabel("log(size)")
pl.plot(xvals,ystar, 'g-')


Out[18]:
[<matplotlib.lines.Line2D at 0x12d247990>]

In [23]:
#
# Now do monte-carlo data fabrication and fit analysis
#

N=1000    # number of samples to take
mList=[]  # keep track of monte-carlo fit parameters
bList=[]

for i in range(N):
    """
    Generate mc data with the same statistical properties as the real data.
    Repeat the fit for each set, and record the parameters.
    """
    mcY = ystar + sigma*np.random.normal(size=len(xvals))
    mcAlpha,mcYstar,mccov = doFit([fx1,fx2],xvals,mcY,sigma)   # repeatedly fit mc data
    mList.append(mcAlpha[1])
    bList.append(mcAlpha[0])
    
#
# Compute the statistics of the mc-results
#

marr=np.array(mList)
barr=np.array(bList)

mAvg = marr.sum()/N
bAvg = barr.sum()/N
delM = marr-mAvg
delB = barr-bAvg
sigM = np.sqrt((delM*delM).sum()/(N-1))  # sigM is the std-deviation of the m values
sigB = np.sqrt((delB*delB).sum()/(N-1))  # sigB is the std-deviation of the b values

#
# plot the fit
#
print("Slope=", mFit, '+/-', sigM, "(", mFit - sigM,",",mFit + sigM, ")")
print("Intercept=", bFit, '+/-', sigB, "(", bFit - sigB,",",bFit + sigB, ")")

#
# Just for fun print out the diagonals of the covariance matrix.
#
print # print a blank link
print("Compare to cov-matrix for fun:")
print("sqsrt(cov[1,1]) (should be sigma m)", np.sqrt(fcov[1,1]))
print("sqrt(cov[0,0]) (should be sigma b)", np.sqrt(fcov[0,0]))


Slope= -0.40101297210387443 +/- 0.0069164777066884975 ( -0.40792944981056295 , -0.3940964943971859 )
Intercept= 10.251110164592234 +/- 0.07947579449122216 ( 10.171634370101012 , 10.330585959083455 )
Compare to cov-matrix for fun:
sqsrt(cov[1,1]) (should be sigma m) 0.006919937608028965
sqrt(cov[0,0]) (should be sigma b) 0.08007908396499791

In [24]:
pl.hist(marr) # look at the variation in "m" values
print("m average:", mAvg)
print("m sigma:",sigM)


m average: -0.4009698835771972
m sigma: 0.0069164777066884975

In [25]:
pl.hist(barr)  # look at the variation in "b" values
print("b average:", bAvg)
print("b sigma:",sigB)


b average: 10.250089828310378
b sigma: 0.07947579449122216

In [ ]: